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Years of sequence feature curation by UniProtKB/Swiss-Prot, PIR-PSD, NCBI-CDD, RefSeq and other database biocurators 
has led to a rich repository of information on functional sites of genes and proteins. This information along with variation- 
related annotation can be used to scan human short sequence reads from next-generation sequencing (NGS) pipelines for 
presence of non-synonymous single-nucleotide variations (nsSNVs) that affect functional sites. This and similar workflows 
are becoming more important because thousands of NGS data sets are being made available through projects such as The 
Cancer Genome Atlas (TCGA), and researchers want to evaluate their biomarkers in genomic data. BioMuta, an integrated 
sequence feature database, provides a framework for automated and manual curation and integration of cancer-related 
sequence features so that they can be used in NGS analysis pipelines. Sequence feature information in BioMuta is collected 
from the Catalogue of Somatic Mutations in Cancer (COSMIC), ClinVar, UniProtKB and through biocuration of information 
available from publications. Additionally, nsSNVs identified through automated analysis of NGS data from TCGA are also 
included in the database. Because of the petabytes of data and information present in NGS primary repositories, a platform 
HIVE (High-performance Integrated Virtual Environment) for storing, analyzing, computing and curating NGS data and 
associated metadata has been developed. Using HIVE, 31 979 nsSNVs were identified in TCGA-derived NGS data from breast 
cancer patients. All variations identified through this process are stored in a Curated Short Read archive, and the nsSNVs 
from the tumor samples are included in BioMuta. Currently, BioMuta has 26 cancer types with 13 896 small-scale and 
308986 large-scale study-derived variations. Integration of variation data allows identifications of novel or common 
nsSNVs that can be prioritized in validation studies. 

Database URL: BioMuta: http://hive.biochemistry.gwu.edu/tools/biomuta/index.php; CSR: http://hive.biochemistry.gwu. 
edu/dna.cgi?cmd=csr; HIVE: http://hive.biochemistry.gwu.edu 
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Introduction 

Rapidly evolving sequencing technologies have exponen- 
tially increased the output of genomics data (1, 2), which 
has led to revolutionary discoveries in cancer biology and 
other biological sciences (3-5). The field of biomarker dis- 
covery has benefited tremendously from this technology, 
with hundreds and thousands of variations being asso- 
ciated with diseases from single studies (6-8). However, 
there are several challenges to analyzing the vast amount 
of data (Big Data) that next-generation sequencing (NGS) 
technologies are creating, and all laboratories do not have 
the resources to perform such large-scale studies (9, 10). 
Therefore, it is not surprising that many researchers still 
publish results from studies that involve less expensive gen- 
otyping technologies producing smaller amounts of data. 
Such smaller studies can sometimes help validate results 
from larger projects, thereby providing unprecedented 
levels of cooperation between scientists engaged in large- 
and small-scale studies. 

The forementioned cooperation is difficult because gen- 
omics data are large, varied, heterogeneous and widely 
distributed. Extracting and converting these data into rele- 
vant information and comparing results across studies have 
become an impediment for personalized genomics (11). 
Additionally, because of the various computational bottle- 
necks associated with the size and complexity of NGS data, 
there is an urgent need in the industry for methods to 
store, analyze, compute and curate genomics data. There 
is also a need to integrate analysis results from large pro- 
jects and individual publications with small-scale studies, so 
that one can compare and contrast results from various 
studies to evaluate claims about biomarkers. 

Databases are mainly of two types: primary databases 
that comprise raw data and secondary databases that ex- 
tract relationships and filter the information available from 
the primary databases and add annotations that are gen- 
erated either manually or automatically. One of the prob- 
lems often faced by end users of Big Data is the lack of 
curated information in primary NGS data repositories, 
such as NCBI Short Read Archive (12) and The Cancer 
Genomics Hub (https://cghub.ucsc.edu/). It is expected that 
curated secondary databases will help organize Big Data 
and make it more user-friendly, similar to what secondary 
databases like RefSeq (13), UniProtKB/Swiss-Prot (14) and 
PIR-PSD (15) have done and are still doing for GenBank 
(16). Coherent organization of analysis results of NGS 
data will also allow use of higher-level databases such as 
Pfam (17), PIRSFs (18), PANTHER (19), KEGG (20) and others 
that group objects into functional groups and provide in- 
formation on biological networks and processes. 

One of the major thrusts of NGS is identification of 
human genetic variations, which is used to better under- 
stand human diseases (21-23). Although computational 



approaches are available to predict which variants are po- 
tentially deleterious and associated with disease (24-27), 
the first steps involved in the process, such as mapping of 
short sequence reads to human reference and identifica- 
tion of single-nucleotide variations (SNVs), are computa- 
tionally expensive, and few investigators have the 
resources or expertise to perform analysis that involves 
downloading terabytes of data from databases and pro- 
cessing and computing on them (10, 28). Furthermore, vari- 
ations that are associated with cancer are currently 
available from diverse databases that use different work- 
flows, and it is challenging to compare results from differ- 
ent sources. Many of these databases and projects have 
specific focus. The cBio cancer genomics portal (29) mostly 
consists of data from The Cancer Genome Atlas (TCGA), and 
its goal is to provide an integrated view of cancer genomics 
data from TCGA and other large projects. International 
Cancer Genome Consortium (ICGC) data portal allows 
member institutions to manage and maintain their own 
databases locally and also allows them to present data 
and information to the users through a single portal (30). 
UniProt provides manually curated cancer mutation data 
that are available from publications (14), and resources 
such as HGMD (31) have added to such data in the past 
few years. The Catalogue of Somatic Mutations in Cancer 
(COSMIC) (32) focuses on curating information on somatic 
mutations in human cancer largely from Cancer Genome 
Project at the Sanger Institute, UK, TCGA and other large- 
scale published projects. Other than UniProt, to the best of 
our knowledge, no group is currently engaged in extracting 
data through extensive manual curation of information 
available in publications and providing it freely. It is well 
known that such curation is hard to perform as expounded 
by Bairoch et al. in their article 'Swiss-Prot: juggling be- 
tween evolution and stability' (33). For NGS data without 
the availability of clear standards in terms of data or ana- 
lysis, it is even more challenging, and it is clear that not one 
group can tackle this challenge alone. There is a pressing 
need to develop data and computational standards as ele- 
gantly outlined in the recent Nature Genetic editorial (34). 
One of the questions posed in the editorial outlines the 
current state of one of the most widely used NGS pipelines 
'\i I run the same sequence reads from a single cancer 
genome through this pipeline of assembly and variant call- 
ing twice, can I expect 70-80% concordance between the 
results?' It is clear something needs to be done, and recent 
publications and efforts by the Human Genome Variation 
Society show that there is a significant interest in the re- 
search community to solve these problems (35). 

In view of some of these difficulties, BioMuta has been 
created to integrate cancer-related non-synonymous single- 
nucleotide variations (nsSNVs) from various sources, which 
are associated with specific cancer types and publications. 
Such integration, we believe, will assist in the development 
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of standards by allowing direct comparison of data pro- 
vided by different groups. BioMuta is an integrated se- 
quence feature database that provides a framework for 
automated and manual curation of features, such as 
nsSNVs. Sequence feature information in BioMuta is col- 
lected not only from COSMIC (36), ClinVar (http://www. 
ncbi.nlm.nih.gov/clinvar/) and UniProtKB (14) but also 
through active biocuration from publications and auto- 
mated analysis of NGS data from sources such as TCGA 
using a novel data analysis platform HIVE (High-perform- 
ance Integrated Virtual Environment) (37, 38). Although 
databases such as COSMIC add large-scale data to their 
databases (97 publications associated with all nsSNVs in 
COSMIC), our goal is to manually curate data from small- 
scale studies that, to the best of our knowledge, is not the 
focus of any of the current resources other than UniProt 
(118 publications associated with cancer). It is important 
to note that UniProt curation effort is more comprehensive 
than just curating cancer biomarkers; hence, we believe 
that our work extends the UniProt effort. We believe that 
computationally and manually curated and integrated data 
and metadata will provide unprecedented value to biolo- 
gical researchers by making available details from multiple 
studies (big and small) that ordinarily a user would not be 
aware of (thereby helping scientists the same way that 
Model Organism Databases, RefSeq, UniProtKB/Swiss-Prot 
and other curated databases have been doing for years). 

Biocuration of data obtained from primary databases re- 
quires a framework for analyzing, annotating and comput- 
ing, which has led to the development of several curation 
tools at all major bioinformatics institutes. Many of these 
biocuration tools are geared toward analysis of small-scale 
data such as small-number genes or proteins and therefore 
are not optimal for analysis of NGS data. In an elegant 
article 'Big data: the future of biocuration', Doug Howe 
and colleagues have pointed out how curation always 
lags behind data generation in funding, development and 
recognition (39). The authors also provide three urgent ac- 
tions to tackle this problem: (i) authors, journals and cur- 
ators should work together; (ii) facilitate community-based 
curation efforts; and (iii) support for scientific curation as a 
professional career. We would like to add a fourth action 
stating that there is an urgent need to also develop novel 
platforms for biocuration of Big Data. Software and hard- 
ware that have worked well for the past decades can no 
longer adequately support the needs of the modern cur- 
ator who is analyzing vast amounts of data. In this article, 
we describe how time-tested curation of sequence features 
through reading papers supplemented with data integra- 
tion from diverse sources and also through the analysis of 
NGS data can help create a comprehensive curated data- 
base of cancer-related nsSNVs, which can be of immediate 
use to the community. We subscribe to the thoughts ex- 
pressed by Howe et al. that biocuration provides an 



organized approach in translating the recent explosion of 
biological data into meaningful results, and curated data- 
bases are essential for novel discoveries in biomedical 
research (39). 

BioMuta data sources 

Data sources for BioMuta are shown in Figure 1. Unless 
otherwise noted, all accessions and identifications (IDs) 
are mapped using ID Mapping table (40), followed by pair- 
wise alignment and mapping of sequence positions with 
methods that have been used previously (24, 41). Only 
those nsSNVs that could be mapped to UniProtKB/Swiss- 
Prot human protein that have the Complete Proteome key- 
word tag are retained in BioMuta. 

Although there are several efforts worldwide to collect 
and disseminate cancer genomics variation data, it is clear 
that the data are heterogeneous and it is difficult for users 
to compare and contrast data from different data sources. 
Different algorithms are used to identify variations, and 
also, to the best of our knowledge, biocuration of variation 
data from publications on cancer biomarkers is limited. In 
all, 118 publications were retrieved from UniProt and 97 
from COSMIC. The BioMuta project through literature 
mining-assisted curation has already added 85 publications 
that are not present in either COSMIC or UniProt. In add- 
ition to this, the complementary Curated Short Read arch- 
ive (CSR) project provides additional mutation data to 
BioMuta from TCGA. Future plans include addition of 
data from ICGC and other cancer genomics projects as 
data linked to publications becomes available from these 
resources (criteria for inclusion of external data in BioMuta 
include association of record with a publication). 

Catalogue of Somatic Mutations in Cancer 

The file, CosmicWGS_MutantExport_v65_220513, which 
contains all coding point mutations, was downloaded 



Curated Short Reads 




Figure 1. nsSNV data from various sources are collected, fil- 
tered and mapped to UniProtKB/Swiss-prot-defined complete 
human proteome and integrated into BioMuta. 
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from COSMIC (42). The first step involved filtering out 
entries without a PubMed identification (PMID). Because 
the cancer descriptions in COSMIC are complex, the 
cancer description columns (primary site, site subtype, pri- 
mary histology and histology subtype) were manually 
checked and converted into TCGA cancer categories 
(https://tcga-data.nci.nih.gov/tcga/tcgaHome2.jsp). A total 
of 283 895 nsSNVs of 904143 variations were retrieved 
from the COSMIC file. 

ClinVar 

ClinVar (http://www.ncbi.nlm.nih.gov/clinvar/) is a database 
that provides information about sequence variations and 
associations to human health. Tables were downloaded 
from the ClinVar ftp site (ftp://ftp.ncbi.nlm.nih.gov/pub/ 
clinvar/). A total of 4590 cancer-related variations with 
PMIDs were retained. The majority of the records from 
ClinVar were filtered out because they either did not 
have PMID or the cancer type was not easily discernible. 

UniProtKB 

UniProt (14) provides comprehensive curated protein se- 
quences and functional information. All proteins that con- 
tained cancer-related keywords (cancer, carcinoma glioma, 
blastoma, leukemia, melanoma, adenocarcinoma, lymphoma 
and tumor) in the sequence feature (FT) line were extracted 
from UniProt KB/Swiss-Prot-defined human complete prote- 
ome, and 2279 manually verified variations were added to 
BioMuta. UniProt does not provide genomic location; hence, 
for these variations genomic locations are not provided. 

Manual curation 

By using key terms [cancer, single-nucleotide polymorphism 
(SNP), biomarker, variant, variation, etc.], articles from 
PubMed (http://www.ncbi.nlm.nih.gov/pubmed) were 
retrieved and manually curated to obtain variation infor- 
mation. PMIDs not present at the time of curation in 
COSMIC, ClinVar and UniProtKB were selected for manual 
curation. A total of 1 39 sites from 85 articles were added to 
BioMuta through this process. 

Curated Short Read archive 

Currently there are thousands of large-scale NGS data from 
patients and cell line samples that are available from pri- 
mary short read data repositories such as TCGA (http://can 
cergenome.nih.gov/) and NCBI Short Read Archive (43) and 
listed through dbGaP (44). We expect that integrated ana- 
lysis of these data will lead to novel discoveries. For ex- 
ample, NGS data from TCGA provides a rich source of 
sequence data that can be mined to extend and comple- 
ment mutation and SNV information available from dbSNP, 
UniProt, COSMIC and other variation databases. We intend 
to identify all nsSNVs from representative samples from all 
data sets that have matched case and controls and also 



have exome and RNA-Seq data. Analysis of these subsets 
of samples provides a rich source for biological discovery. 
All variation data can be further analyzed using SNVDis, 
which is a proteome-wide SNV distribution analysis tool 
(24). For this study, NGS data from 20 breast cancer patients 
(22 tumor samples and 33 normal) were analyzed to iden- 
tify nsSNVs. Results from this analysis and additional infor- 
mation such as phenotypic information were curated and 
added to a CSR. A total of 31 979 nsSNVs of 291 803 SNVs 
from tumor samples were added to BioMuta. Direct access 
to CSR is available at http://hive.biochemistry.gwu.edu/dna. 
cgi?cmd=csr. Users can search for variations present in 
tumor and normal samples using gene or protein accession 
numbers and view whether the variation is already present 
in dbSNP. Searching using TCGA IDs is also supported. The 
CSR curation platform is supported by HIVE, which is 
described in the section below. 

HIVE for biocuration 

A sophisticated IT framework is required for analyzing, 
annotating and computing the vast amounts of data 
generated using NGS technologies. HIVE provides such a 
platform and is used to analyze NGS data. HIVE is a bio- 
computing operating system, which provides the ideal 
backbone to integrate modular software into a data analy- 
tics backbone. The HIVE architecture provides a highly par- 
allel processing environment, which allows optimal 
compatibility and performance for both native and indus- 
try-standard tools. All algorithmic services and tools manip- 
ulate data from three sources: data loaded preliminary into 
the system, data provided by the user during a computa- 
tion inquiry or data computed during a previous computa- 
tion. HIVE has an ensemble of parsers, loaders, converters 
and validators for all industry-standard biological data for- 
mats (such as sequences, alignments, profiles). All data in 
the system are available for downloading in a variety of 
industry-accepted data formats (fasta, SFF, fastq, BAM, 
SAM). The primary step in many genomic workflows is to 
align and map short reads to a reference genome. There 
are several software programs with their own alignment 
algorithms. The different algorithmic approaches of each 
tool create computational trade-offs in speed, accuracy 
and performance to optimize the detection of variants in 
the alignment (45, 46). Currently, HIVE has the following 
alignment tools integrated and parallelized: HIVE-hexagon 
(native HIVE alignment tool), Bowtie (47) and BWA (48). 
After alignment of short reads to a reference genome 
with any of the alignment tools, variants can be identified 
through comparison of the sample genome with the refer- 
ence genome. 

Currently, the following protocol is in use in the CSR pro- 
ject (a BioMuta data source) to identify variations: Short read 
data are obtained from TCGA (http://cancergenome.nih.gov/) 
via The Cancer Genomics Hub data portal (https://cghub.ucsc. 
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edu/). The reference used in the alignment is the hg19, 
GRCh37 Genome Reference Consortium Human Reference 
37 (GCA_000001 405.1) downloaded from UCSC (http:// 
hgdownload.cse.ucsc.edu/goldenPath/hg19/chromosomes/). 
UniProtKB protein amino acid position and ID mapping is 
done using SNVDis and ID Mapping services (24, 40). After 
the raw SNV data are generated using Bowtie (47) and SAM 
tools (49), filters are used to select high-quality SNVs that are 
of desirable coverage (>10 reads) and quality score (>20). 
The filtration process also rejects detected SNVs falling out of 
the exome regions. Results of the variation profiling tool can 
be further evaluated manually using HIVE native displays as 
shown in Figure 2. 

BioMuta content 

To ensure usability of the database, care is taken to verify 
that all SNVs in BioMuta have the following characteristics: 



(i) has PMID for all imported data, (ii) is an nsSNV, (iii) can 
be mapped to UniProtKB/Swiss-Prot-defined human 
proteome and (iv) has either gene/protein or genome coor- 
dinates. SNVs associated with PMIDs that report <1000 var- 
iations are considered small-scale study variations, and 
those that are associated with PMIDs that report >1000 
variations are considered large-scale studies and are 
hence marked as large-scale study variation. Literature 
mining variations are those that are automatically 
extracted through literature mining procedures. Such var- 
iations are currently not available to the public. Table 1 
provides detailed statistics of the number of variations 
obtained from different databases. The majority of the var- 
iations are obtained from COSMIC and CSR-TCGA. Through 
manual curation of 85 publications, 1 39 sites were added to 
BioMuta. Adding such manually curated records in BioMuta 
is one of the top priorities of the project. Table 2 provides 
an overview of example search parameters, number of 
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Figure 2. HIVE interface showing result obtained from SNV profiling of short sequence reads mapped to nucleotide sequence 
surrounding a variation site. (A) Overall coverage result with the 121485241 position, showing variation. (B) Reads mapped to 
the reference sequence with the column of interest are highlighted in yellow. (C) Only variations are shown in this panel. 
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Table 1. Twenty-six cancer types and 322 882 (small-scale: 13 896; large-scale: 308 986) associated variations in BioMuta 



Cancer types^ COSMIC UniProt ClinVar Manual CSR-TCGA 



Small- Large- Small- Large- Small- Large- Small- Large- Small- Large- 
scale"^ scale'^ scale scale scale scale scale scale scale scale 



Lung (LUAD) 


121 


80 006 


105 








Colon (COAD) 


486 


68 249 


235 






20 


Breast (BRCA) 


176 


7386 


342 


1 


3314 


16 31979 


Esophageal (ESCA) 


43 


25 980 








1 


Ovarian (OV) 


1229 


16411 


31 




1276 


4 


Skin (SKCM) 


496 


17 041 








2 


Prostate (PRAD) 


77 


10 920 








1 


Head and neck (HNSC) 


716 


11 838 








1 


Rectum (READ) 




9760 








10 


Lymphoid (DLBC) 


1710 


7006 










Adrenocortical (ACC) 


1000 


4515 




1 






Pancreatic (PAAD) 


896 


3164 


3 








Brain (LGG) 


773 


2383 










Uterine (UCEC) 


490 


1414 


1 








Kidney (KIRC) 


893 


115 










Liver (LIHC) 


1224 


1023 


14 






3 


Glioblastoma (GBM) 


776 












Acute myeloid (LAML) 


409 










8 


Thyroid (THCA) 


513 




7 






3 


Bladder (BLCA) 


450 




2 








Lung (LUSC) 




256 










Stomach (STAD) 


89 












Kidney renal (KIRP) 


33 




42 








Kidney chromo (KICH) 


57 












Non-small lung (NSCLC) 






4 






8 


Cervical (CESC) 






1 






5 


Other"^ 


5319 













^LUAD, lung adenocarcinoma; COAD, colon adenocarcinoma; BRCA, breast invasive carcinoma; ESCA, esophageal carcinoma; OV, ovarian 
serous cystadenocarcinoma; SKCM, skin cutaneous melanoma; PRAD, prostate adenocarcinoma; HNSC, head and neck squamous cell 
carcinoma; READ, rectum adenocarcinoma; DLBC, lymphoid neoplasm diffuse large B-cell lymphoma; ACC, adrenocortical carcinoma; 
PAAD, pancreatic adenocarcinoma; LGG, brain lower grade glioma; UCEC, uterine corpus endometrial carcinoma; KIRC, kidney renal clear 
cell carcinoma; LIHC, liver hepatocellular carcinoma; GBM, glioblastoma multiforme; LAML, acute myeloid leukemia; THCA, thyroid 
carcinoma; BLCA, bladder urothelial carcinoma; LUSC, lung squamous cell carcinoma; STAD, stomach adenocarcinoma; KIRP, kidney 
renal papillary cell carcinoma; KICH, kidney chromophobe; NSCLC, non-small cell lung cancer; CESC, cervical squamous cell carcinoma 
and endocervical adenocarcinoma. 

"^Small-scale — SNVs associated with publications that report <1000 SNVs. 

"^Large-scale — SNVs associated with publications that report >1000 SNVs or SNVs identified using computational pipelines from existing 
NGS data. 

^Cancer types not specified or well defined. 



articles retrieved and overall nunnber of articles that were 
found to include variation infornnation that can be included 
in BioMuta. More specifically, different connbinations of 
search ternns are used and nnultiple search results are conn- 
bined to create a nonredundant set of PMIDs. Title and 
abstracts are read to extract articles of interest. Titles/ 
abstracts with gene/protein nanne, cancer type, tunnor 



type, variation site, nnutation- and bionnarker-related 
words are prioritized for curation. The next step involves 
reading the nnanuscript and any relevant Supplennentary 
Tables to retrieve variation-related infornnation. Finally, 
accession nunnbers, nnutation and nnutation positions are 
verified, and attennpts are nnade to nnanually check and 
include nnissing infornnation such as chronnosonnal location. 
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Table 2. Example PubMed search terms and results 



Search terms 


Total 


Positive 




articles^ 


articles^ 


SNP, biomarker, cancer 


702 


60 


Biomarker, cancer, single-nucleotide- 


1986 


43 


polymorphism 






Polymorphism, biomarker, cancer 


5215 


20 


SNP, exon, cancer 


394 


16 


Gene name^ cancer, SNP 


20 


4 


Total 




143'' 



^Total number of articles retrieved using the search terms. 
•^Articles from which data were extracted for inclusion in BioMuta. 
'Targeted curation of specific genes, e.g. MTA1, MTA2, SULF2, 
SHBG, DLX4, etc. 

''Articles and annotations that pass validation step are retained. 



accession nunnber/s and valid HGNC gene synnbols. Curation 
results are cross-checked by curators and through the vali- 
dation process. 

Users can download the entire BioMuta table or browse 
the database by searching for records using gene nannes 
and UniProtKB or RefSeq accessions. Search results include 
a graphical representation of the nnutations and a table 
that can be downloaded in tab-del innited fornnat and 
further analyzed by Microsoft Excel or sinnple scripts. 
Users have the ability to select a specific row in the results 
table and send connnnents to BioMuta curators. This type of 
direct feedback will help us innprove database content. All 
records are linked to the Early Detection Research Network 
(EDRN) Knowledge Environnnent through its online public 
portal (50). EDRN is a distributed knowledge network that 
integrates cancer bionnarker research results fronn across 
the network. This includes the integration of annotations 
regarding bionnarkers under study with the results fronn 
those studies that can be used for analysis. The bionnarkers 
thennselves are annotated fronn studies perfornned by the 
EDRN and linked to the publications and external protein 
and genonnic databases. The annotations include infornna- 
tion about the success of the bionnarkers that have been 
studied. The EDRN Knowledge Environnnent allows for 
external linking to the specific data captured within the 
systenn. This has allowed for BioMuta and EDRN to be 
linked together through specific attributes of the bionnar- 
kers, including connnnon gene nannes provided by HGNC 
(51), which are annotated with the bionnarkers within the 
EDRN knowledge systenn and bionnarker database. The inte- 
gration of these highly curated systenns beconnes plausible 
given the adoption of connnnon identifiers and the pronno- 
tion of online portals and web services. 

BioMuta has data fronn various sources, and it is possible 
that Sonne of these databases nnight contain errors in ternns 
of the genonnic coordinates and/or the gene/protein 



positions. To reduce the propagation of these types of 
errors, we have validation procedures to check the table. 
To address the heterogeneity of the different variation 
data sources, all variant records are unified via the 
UniProtKB/Swiss-Prot hunnan proteonne set by providing 
each variant a UniProtKB accession nunnber and position. 
To achieve this, all variations with genonnic coordinates are 
first nnapped to genes/transcripts using SeattleSeq services 
(http://snp.gs.washington.edu/SeattleSeqAnnotation137) 
and then nnapped to UniProtKB protein accession and posi- 
tion using nnethods described previously (24). Briefly, the 
nnapping process includes nnapping of RefSeq accessions to 
UniProtKB accessions using ID nnapping services provided by 
Protein Infornnation Resource and UniProt (40), followed by 
pairwise alignnnent of the sequences to nnap the positions. 
For all the records that cannot be correctly nnapped to the 
coding region or if the annino acid does not nnatch the 
UniProtKB-defined proteonne or for nucleotides if they do 
not nnatch the RefSeq nucleotide for that position, the 
entire row is discarded after nnanual evaluation of the error. 

BioMuta utility 

One of the innnnediate applications of the BioMuta project is 
to evaluate variations that are obtained fronn various sources 
through biocuration and thereby provide ways to prioritize 
variations for further experinnental evaluation by the EDRN 
connnnunity and others. The evaluations can be perfornned by 
both connparing and contrasting nnutations data fronn differ- 
ent cancer types and/or fronn different studies. Additional 
evaluation of nnutations can also be perfornned by interro- 
gating NGS data fronn TCGA or ICGC or other projects to see 
whether specific nnutations are present in certain cancer 
types and what their frequency is (Figure 3). 

One of the goals of researchers is assessing the func- 
tional innpact of variations. Figure 4a provides exannple ana- 
lysis results of how the BioMuta data can be used to better 
understand the functional innpact of nsSNVs fronn different 
cancer types. For this analysis, all the nsSNVs were nnapped 
to functional sites that were obtained fronn UniProtKB 
sequence feature annotation. Based on this analysis, we 
notice that a large nunnber of posttranslational nnodifica- 
tions (PTMs) and active and binding sites are affected by 
nsSNVs. It is interesting to note that for breast cancer, there 
is a high nunnber of nsSNVs that affect N-linked glycosyla- 
tion sites. To find out whether certain types of PTM or 
other functional sites are resistant to variations, P-values 
were calculated based on nnethods described earlier (24, 
52), to estinnate the significance between observed and 
expected nunnbers. The results indicate, as expected, for 
several of the functional sites, observed variations are sig- 
nificantly lower than the calculated expected values. More 
specifically — acetylation: observed 105, expected 224.28, 
P-value 4.53E-19; active site: observed 59, expected 93.18, 
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Figure 3. BioMuta data flow and utility in evaluating variations obtained from various cancers. 



P-value 9.89E-05; binding site: observed 102, expected 
123.69; C-linked glycosylation: observed 1, expected 1.68; 
y-carboxyglutamic acid: observed 5, expected 2.61; 
methylation: observed 45, expected 26.09, P-value 2.42E- 
07; N-linked glycosylation: observed 551, expected 467.78, 
P-value 9.60E-05; 0-linked glycosylation: observed 28, 
expected 76.64, P-value 1.54E-10; palmitoylation: observed 
1, expected 4.52; phosphorylation: observed 1083, expected 
2325.07, P-value 1.71E-183; prenylation: observed 0, 
expected 2.00; S-nitrosylation: observed 7, expected 22.25, 
P-value 1.65E-04; sulfation: observed 1, expected 1.68; 
sumoylation: observed 7, expected 19.19, P-value 1.33E- 
03; ubiquitylation: observed 257, expected 668.12, P-value 
3.78E-74 (P-values >0.05 are not shown). Data from specific 
cancer types were also analyzed to evaluate whether cer- 
tain PTMs are more affected by certain types of cancer 
(Figure 4b). The majority of functional sites analyzed 
seem to be protected from mutation (significantly less 
observed variations than expected). It is hard to explain 
why for certain cancer types some of the functional sites 
appear to be less protected. More data are required to 
evaluate these trends. All the variations obtained in our 
pipeline are also integrated into SNVDis (24). To facilitate 
evaluation of the effects of variations, PolyPhen-based (6) 



predictions are also included in the BioMuta table. SNVDis 
provides users with applications that can be used to evalu- 
ate the distribution of nsSNVs on protein functional sites, 
domains and pathways at the entire proteome level. Such 
proteome-wide analysis is complementary to functional 
impact analysis using methods such as PolyPhen (27) and 
SIFT, (26) and similar algorithms. 

Integration of data, as the one performed in this study, 
allows identification of genes that have high level of varia- 
tions. From the small-scale category, the top five genes in 
terms of number of unique nsNSVs-PMID pairs are TP53, 
PBRM1, MENI, ARID1A and NF1. For the large-scale cate- 
gory, the genes are BRCA2, BRCA1, TP53, TTN and 
CACNA1C. In search of variants that are recorded in more 
than one database, the variants that have same UniProtKB 
accession, amino acid position and variation were identified. 
There are 518 variants found in two or more data sources 
(Supplementary Table SI). We expect this overlap to 
increase as more data from other cancer-genomics studies 
are included. Of these 518 overlapping variants, the CSR 
database contributes the most to this number of shared var- 
iants (>95%), thereby showcasing the utility of evaluating 
variations by analyzing TCGA data. Another interesting fact 
is that almost 90% (23 of 26) of the literature-based 
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Figure 4. Loss of functional sites (PTM sites, active and binding sites). (A) Top six cancer types with the highest number of records 
in BioMuta. Lung adenocarcinoma (LUAD), colon adenocarcinoma (COAD), breast invasive carcinoma (BRCA), esophageal carci- 
noma (ESCA), ovarian serous cystadenocarcinoma (OV) and skin cutaneous melanoma (SKCM). (B) Statistical analysis of loss of 
functional sites show that for some cancer type-specific functional sites are less susceptible to variation (colored graph area 
almost touching the perimeter — where perimeter represents P-value close to 0). 



manually curated variation overlaps are from CSR and not 
from COSMIC, suggesting that CSR database even with a 
limited number of patient data might already be useful in 
evaluating published cancer-related variations. 

Our overarching goal is to provide whole-genome and 
exome analysis capabilities through HIVE or similar 



platforms where users can upload short read sequences 
and map them to the human reference genome followed 
by flagging of sites that are impacted by variations and are 
already reported by other studies. Such analysis will allow 
researchers to quickly evaluate personal genomes of 
patients or study subjects. 
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Future perspective 

Efforts directed toward creating databases such as CSR, 
ClinVar, RefSeq, UniProt, HGMD (31) PharmGKB 54), 
IntOGen (7), ICGC (8) and others, which provide informa- 
tion on variations and disease or other phenotypic details, 
will provide methods connecting genomic alterations with 
clinical parameters. These efforts are vital for using the full 
potential of NGS technologies (3), leading to novel discov- 
eries that will translate to diagnostic and therapeutic tar- 
gets (4, 54). All of these databases will benefit from 
additional variation sites extracted through the biocuration 
of information from publications. Our future plans include 
automated literature mining methods that will provide tar- 
geted extraction of publications that can be used to anno- 
tate major cancer genes. We also intend to provide 
community annotation tools to cancer biologists so that 
they can add notes related to experimental validation of 
the variations and the possibility of using them as diagnos- 
tic or prognostics markers. This information can be used by 
curators to provide additional structured information to 
these entries. Engaging the entire scientific community in 
community annotation efforts has been difficult (51, 55). 
Therefore, we will initially focus on involving the EDRN 
community and other specific cancer researcher groups 
such as Alliance of Glycobiologists (http://glycomics. cancer, 
gov/) in our initial community annotation efforts. 

Access 

BioMuta and CSR are updated at least once every 4 months. 
Access to all data is available without any login require- 
ments. To use HIVE'S computationally intensive tools, users 
need to register at http://hive.biochemistry.gwu.edu/dna. 
cgi?cmd=userReg. Temporary login is provided for evalua- 
tion purposes such as browsing the interfaces or viewing 
example analysis results. HIVE login URL: http://hive.biochem- 
istry.gwu.edu/dna.cgi?cmd=login&follow=home; evaluation 
userid: xlhive@yahoo.com; password: pilotHive5. Users can 
also install HIVE on their own hardware or use HIVE-in-a- 
box, which is a low-cost alternative to analyze NGS data 
using predetermined workflows. For additional details, 
users are encouraged to contact the HIVE team (http://hive. 
biochemistry.gwu.edu/dna.cgi?cmd=contact). 

Supplementary Data 

Supplementary data are available at Database Online. 
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